Efficient constraint monitoring using adaptive thresholds

ABSTRACT

Methods for tracking anomalous behavior in a network referred to as non-zero slack schemes are provided. The non-zero slack schemes reduce the number of communication messages in the network necessary to monitor emerging large-scale, distributed systems using distributed computation algorithms by generating more optimal local constraints for each remote site in the system.

PRIORITY STATEMENT

This non-provisional patent application claims priority under 35 U.S.C.§119(e) to provisional patent application Ser. No. 60/993,790, filed onJun. 8, 2007, the entire contents of which are incorporated herein byreference.

BACKGROUND OF THE INVENTION

When monitoring emerging large-scale, distributed systems (e.g., peer topeer systems, server clusters, Internet Protocol (IP) networks, sensornetworks and the like), network monitoring systems must process largevolumes of data in (or near) real-time from a widely distributed set ofsources. For example, in a system that monitors a large network fordistributed denial of service (DDoS) attacks, data from multiple routersmust be processed at a rate of several gigabits per second. In addition,the system must detect attacks immediately after they happen (e.g., withminimal latency) to enable networks operators to take expedientcountermeasures to mitigate effects of these attacks.

Conventionally, algorithms for tracking and computing wide ranges ofaggregate statistics over distributed data streams are used to processthese large volumes of data. These algorithms apply to a general classof continuous monitoring applications in which the goal is to optimizethe operational resource usage, while still guaranteeing that theestimate of the aggregate function is within specified error bounds. Inmost cases, however, transmitting the required amount of data across thenetwork to perform distributed computations is impractical. To reducethe amount of communication, distributed constraints monitoring ordistributed trigger mechanisms are utilized. These mechanisms reduce thecommunication needed to perform the computations by filtering out“uninteresting” events such that they are not communicated across thenetwork. An “uninteresting” event refers to a change in value at someremote site that does not cause a global function to exceed a thresholdof interest. In many cases, however, such mechanisms do not sufficientlyreduce the necessary communication volume so as to provide efficientnetwork monitoring, while still providing sufficient communicationefficiency.

FIG. 1 illustrates a conventional distributed monitoring methodutilizing what is referred to as a zero-slack scheme. In a zero-slackscheme, a central coordinator such as a network operations center s₀assigns local constraint threshold values T_(i) to each remote site s₁,. . . , s_(n) according to Equation (1) shown below.

T _(i) =T/n, ∀i ∈ [1, n]  Equation (1)

In Equation (1), T is a global constraint threshold value for the systemand n is the number of nodes or remote sites in the system. In oneexample, the global constraint threshold corresponds to the total numberof bytes that passed the service provider network in the past second.FIG. 1 illustrates a conventional distributed monitoring method. Themethod shown in FIG. 1 will be discussed with regard to the conventionalsystem architecture shown in FIG. 2.

Referring to FIG. 1, at step S502 if remote site s_(j) (where j=1, 2, 3,. . . ) observes a value of the variable x_(j) that is greater than itsassigned local constraint threshold value T_(j), the site s_(j)determines that its local constraint threshold value T_(j) has beenviolated. In response, the remote site s_(j) generates a local alarmtransmission to notify the coordinator s₀ of the local constraintthreshold violation at remote site s_(j) at step S504. The local alarmtransmission also informs the coordinator s₀ of the observed value x_(j)causing the local alarm transmission. As discussed herein, variablex_(j) may be the total amount of traffic (e.g., in bytes) entering intoa network through an ingress point. The variable x_(j) may also be anobserved number of cars on the highway, an amount of traffic from amonitored network in a day, the volume of remote login (e.g., TELNET,FTP, etc.) requests received by hosts within the organization thatoriginate from the external hosts, packet loss at a given remote site ornetwork node, etc.

At step S506, when the coordinator s₀ receives the local alarmtransmission from site s_(j), the coordinator s₀ calculates an estimateof the global aggregate value according to Equation (2) shown below.

x_(j)+Σ_(i≠j)T_(i)   Equation (2)

In Equation (2), each local constraint T_(i) represents an estimate ofthe current value of variable x_(i) at each node other than x_(j), whichare known at the central coordinator s₀. At step S508, the centralcoordinator s₀ then determines whether Equation (3) is satisfied.

x _(j)+Σ_(i≠j) T _(i) ≦T   Equation (3)

If Equation (3) is not satisfied, the central coordinator s₀ sends amessage requesting current values of the variable x_(i) to each remotesite s₁, . . . , s_(n) at step S510. This transmission of messages isreferred to as a “global poll.” In response, each remote site sends anupdate message including the current value of the variable x_(i). Usingthese obtained values for variables x₁, x₂, . . . x_(n), the centralcoordinator s₀ determines if the global network constraint threshold Thas been violated at step S512.

That is, for example, the central coordinator s₀ aggregates the valuesfor variables x₁, x₂, . . . x_(n) and compares the aggregate value withthe global constraint threshold. If the aggregate value is greater thanthe global constraint threshold, then the central coordinator s₀determines that the global constraint threshold T is violated. If thecentral coordinator s₀ determines that the global constraint threshold Tis violated, the central controller s₀ records violation of the globalconstraint threshold in a memory at step S514. In one example, thecentral controller s₀ may generate a log, which includes time, date, andparticular values associated with the constraint threshold violation.

Returning to step S512, if the central coordinator s₀ determines thatthe global constraint threshold Tis not violated, the process terminatesand no action is taken. Returning to step S508, if the centralcoordinator s₀ determines that Equation (3) is satisfied, the centralcoordinator s₀ determines that a global poll is not necessary, theprocess terminates and no action is taken.

This method is an example of a zero slack scheme in which the sum of thelocal thresholds T_(i) for all remote sites in the network is equal tothe global constraint threshold T, or in other words,

${\sum\limits_{i = 1}^{n}T_{i}} = {T.}$

In this case, a local alarm transmission results in a global poll by thecentral coordinator s₀ because any violation of a local constraintthreshold for any node causes the central coordinator s₀ to estimatethat the global constraint threshold T is violated. Using a zero-slackscheme, however, results in relatively high communication costs due tothe frequency of local alarms and global polls.

SUMMARY

Example embodiments provide methods for tracking anomalous behavior in anetwork referred to as non-zero slack schemes, which may reduce thenumber of communication messages in the network (e.g., by about 60%)necessary to monitor emerging large-scale, distributed systems usingdistributed computation algorithms.

In illustrative embodiments, system behavior (e.g., global polls) isdetermined by multiple values at the various sites, and not a singlevalue as in the conventional art. At least one illustrative embodimentuses Markov's Inequality to obtain a simple upper bound that expressesthe global poll probability as the sum of independent components, oneper remote site involving the local variable plus constraint at theremote site. Thus, optimal local constraints (e.g., the localconstraints that minimize communication costs) may be computed locallyand independently by each remote site without assistance from a centralcoordinator.

Non-zero slack schemes according to illustrative embodiments discussedherein may result in lower communication costs.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a conventional method for distributed monitoring;

FIG. 2 is a conventional system architecture;

FIG. 3 is a flow chart illustrating a method for generating andassigning local constraints to remote sites in a system according to anillustrative embodiment;

FIG. 4 is a flow chart illustrating a method for generating a localconstraint using the Markov-based algorithm according to an illustrativeembodiment; and

FIG. 5 is a flow chart illustrating a method for generating a localconstraint for a remote site using a reactive algorithm according to anillustrative embodiment.

DETAILED DESCRIPTION OF THE INVENTION

Illustrative embodiments are directed to methods for generating and/orassigning local constraints to nodes or remote sites within a networkand methods for tracking anomalous behavior using the assigned localconstraint thresholds. Anomalous behavior may be used to indicate thataction is required by a network operator and/or system operationscenter. The methods described herein utilize non-zero slack schemealgorithms for determining local constraints that retain some slack inthe system.

In the following description, illustrative embodiments will be describedwith reference to acts and symbolic representations of operations (e.g.,in the form of flowcharts) that may be implemented as program modules orfunctional processes include routines, programs, objects, components,data structures, etc., that perform particular tasks or implementparticular abstract data types and may be implemented using existinghardware at existing central coordinators or nodes/remote sites. Suchexisting hardware may include one or more digital signal processors(DSPs), application-specific-integrated-circuits (ASICs), fieldprogrammable gate arrays (FPGAs) computers or the like.

Where applicable, variables or terms used in the following descriptionrefer to and are representative of the same values described above. Inaddition, the terms threshold and constraint may be consideredsynonymous and may be used interchangeably.

Unlike zero-slack schemes, in the disclosed non-zero slack schemes, eachremote site is assigned a local constraint (or threshold) T_(i) suchthat

${{\sum\limits_{i = 1}^{n}T_{i}} \leq T},$

where T is again the global constraint threshold for the system and n isthe number of nodes in the system. In such a non-zero slack scheme, theslack SL refers to the difference between the global threshold value andthe sum of the remote site threshold values in the system. Moreparticularly, the slack is given by

${SL} = {T - {\sum\limits_{i = 1}^{n}{T_{i}.}}}$

Illustrative embodiments will be described herein as being implementedin the conventional system architecture of FIG. 1 discussed above.However, it will be understood that illustrative embodiments may beimplemented in connection with any other network or system.

As is the case in the conventional zero-slack schemes, the globalconstraint may be decomposed into a set of local thresholds, T_(i) ateach remote site s_(i). Unlike the zero-slack schemes, however, inillustrative embodiments local constraint values (hereinafter localconstraints) T_(i) may be generated and/or assigned such that

${\sum\limits_{i = 1}^{n}T_{i}} \leq {T.}$

In effect, generating and/or assigning local constraints T_(i)satisfying

${\sum\limits_{i = 1}^{n}T_{i}} \leq T$

filters out “uninteresting” events in the system to reduce the amount ofcommunication overhead. As noted above, an “uninteresting” event is achange in value at some remote site that does not cause a globalfunction to exceed a threshold of interest.

Brute-Force Algorithm

One embodiment provides a method for assigning local constraints tonodes in a system using a “brute force” algorithm. The method may beperformed at the central coordinator s₀ in FIG. 1.

FIG. 3 is a flow chart illustrating a method for generating andassigning local constraints to remote sites in a system according to anillustrative embodiment. The communication between the centralcoordinator s₀ and each remote site s_(i) may be performed concurrently.

Referring to FIG. 3, at step S202 the central coordinator s₀ receiveshistogram updates in an update message. As discussed above, each sites_(i) (wherein i=1, . . . , n) observes a continuous stream of updates,which it records as a constantly changing value of its local variablex_(i). As was the case with x_(j), variable x_(i) may be the totalamount of traffic (e.g., in bytes) entering into a network through aningress point. The variable x_(i) may also be an observed number of carson the highway, an amount of traffic from a monitored network in a day,the volume of remote login (e.g., TELNET, FTP, etc.) requests receivedby hosts within the organization that originate from the external hosts,packet loss at a given remote site or network node, etc.

In one example, each remote site si maintains a histogram of theconstantly changing value of its local variable x_(i) observed over timeas H_(i)(v), ∀v ∈ [0, T], where H_(i)(v) is the probability of variablex_(i) having a value v). The update messages may be sent and receivedperiodically, wherein the period is referred to as the recomputeinterval.

At step S204, in response to receiving the update messages from theremote sites, the central coordinator s₀ generates (calculates) localconstraints T_(i) for each remote site s_(i). The central coordinator s₀may generate local constraints T_(i) based on a total system cost C aswill be described in more detail below.

In one example, the coordinator s₀ first calculates a probabilityP_(l)(i) of a local alarm for each individual remote site (hereinafterlocal alarm probability) according to Equation (4) shown below.

$\begin{matrix}{{P_{l}(i)} = {{\Pr \left( {x_{i} > T_{i}} \right)} = {1 - {\sum\limits_{j = 0}^{T_{i}}{H_{i}(j)}}}}} & {{Equation}\mspace{14mu} (4)}\end{matrix}$

In Equation (4), Pr(x_(i)>T_(i)) is the probability that the observedvalue at remote site s_(i) is greater than its threshold T_(i) and isindependently calculated for a given local constraint T_(i). Thus, thelocal alarm probability P_(l)(i) is entirely independent of the state ofthe other remote sites. In other words, the local alarm probabilityP_(l)(i) for each remote site s_(i) is independent of values of variablex_(i) at other remote sites in the system.

In addition to determining a local alarm probability for each remotesite, the central coordinator s₀ determines a probability P_(g) of aglobal poll (hereinafter referred to as a global poll probability) inthe system according to Equation (5) shown below:

$\begin{matrix}{P_{g} = {{\Pr \left( {Y > T} \right)} = {1 - {\sum\limits_{v = 0}^{T}{\Pr \left( {Y = v} \right)}}}}} & {{Equation}\mspace{14mu} (5)}\end{matrix}$

In Equation (5), Y=Σ_(i)Y_(i), and Y_(i) is an estimated value for x_(i)at each remote site s_(i) in the system. The estimated values Y_(i) arestored at the coordinator s₀ such that Y_(i)≧x_(i) at all times. Thecentral coordinator s₀ updates the stored values Y_(i) based on valuesx_(i) reported in local alarms from each remote site. In a more specificexample, the coordinator s₀ receives updates for values x_(i) at remotesite s_(i) via a local alarm message generated by remote site s_(i) oncethe observed value x_(i) exceeds its local constraint T_(i). The storedvalues Y_(i) at the central coordinator s₀ for each remote site may besummarized as:

$Y_{i} = \left\{ \begin{matrix}{{x_{i}\mspace{14mu} {for}\mspace{14mu} {each}\mspace{14mu} s_{i}\mspace{14mu} {that}\mspace{14mu} {reports}\mspace{14mu} a\mspace{14mu} {local}\mspace{14mu} {alarm}};{and}} \\{T_{i}\mspace{14mu} {for}\mspace{14mu} {each}\mspace{14mu} s_{i}\mspace{14mu} {that}\mspace{14mu} {has}\mspace{14mu} {not}\mspace{14mu} {reported}\mspace{14mu} {{anything}.}}\end{matrix} \right.$

Still referring to Equation (5), Pr(Y=v) is the probability that Y=ν,where ν is a constant, which may be chosen by a network operator. Thecentral coordinator s₀ computes the probability Pr(Y=v) using a dynamicprogramming algorithm with pseudo-polynomial time complexity of O(nT²).As is well-known, O(nT²) is a standard notation indicating running timeof an algorithm. Unlike the local alarm probability P_(l), the globalalarm probability P_(g) is dependent on the state of all remote sites inthe system. In other words, the global alarm probability P_(g) isdependent on values of variable x_(i) at other remote sites in thesystem.

Still referring to step S204 of FIG. 3, the central coordinator s₀generates the local threshold T_(i) for remote site s_(i) based on thetotal system cost C given by Equation (6) shown below.

$\begin{matrix}{C = {{P_{g}C_{g}} + {\sum\limits_{i = 1}^{n}{{P_{l}(i)}C_{l}}}}} & (6)\end{matrix}$

In Equation (6), P_(l)(i) is the local alarm probability at site s_(i),P_(g) is the global poll probability, C_(l) is the cost of a local alarmtransmission message from remote site s_(i) to the coordinator s₀ andC_(g) is the cost of performing a global poll by the central coordinators₀. Typically, C_(l) is O(l) and C_(g) is O(n), where O(l) and O(n)differ by orders of magnitude. In one example, O(l) is a constantindependent of the size of system and O(n) is a quantity that growslinearly with the size of the system.

For instance, if there are 1000 remote sites in the system, then C_(l)may be a first value (e.g., 10) and C_(g) is another value (e.g., 100).As the network increases in size, (e.g., by adding another 9000 nodes),C_(l) remains close to 10, but C_(g) increases much larger than 100. Assuch, C_(g) grows much faster than C_(l) as network size increases.

More specifically, the central coordinator s₀ generates localconstraints T_(i) for each remote site s_(i) to minimize the totalsystem cost C.

In one example, the central coordinator s₀ performs a naive exhaustiveenumeration of all T^(n) possible sets of local threshold values togenerate the local constraints at each remote site that result inminimum total system cost C. For each combination of threshold values,the local alarm probability P_(l)(i) at each remote site s_(i) and theglobal poll probability P_(g) value are calculated to determine thetotal system cost C. In this case, this naive enumeration has a runningtime of O(nT^(n+2)).

To reduce the running time, only local threshold values in the range[T_(i)−δ, T_(i)+δ] for a small constant δ may be considered. The smallconstant δ may be determined experimentally and assigned, for example,by a network operator at a network operations center.

Returning to FIG. 3, at step S206, the central coordinator s₀ sends eachgenerated local constraint T_(i) to its corresponding remote site s_(i).

Markov-Based Algorithm

Another illustrative embodiment provides a method for generating localconstraints using a Markov-based algorithm. This embodiment usesMarkov's inequality to approximate the global poll probability P_(g)resulting in a decentralized algorithm, in which each site s_(i) mayindependently determine its own local constraint T_(i). As iswell-known, in probability theory, Markov's inequality gives an upperbound for the probability that a non-negative function of a randomvariable is greater than or equal to some positive constant.

FIG. 4 is a flow chart illustrating a method for generating a localconstraint using the Markov-based algorithm according to an illustrativeembodiment. As noted above, the method shown in FIG. 4 may be performedat each individual remote site in the system.

Referring to FIG. 4, at step S302, using a Markov's inequality, remotesite s_(i) approximates a global poll probability P_(g) according toEquation (7) shown below.

$\begin{matrix}{P_{g} = {{{\Pr \left( {Y > T} \right)} \leq \frac{E\lbrack Y\rbrack}{T}} = {\frac{E\left\lbrack {\sum\limits_{i = 1}^{n}Y_{i}} \right\rbrack}{T} = \frac{\sum\limits_{i = 1}^{n}{E\left\lbrack Y_{i} \right\rbrack}}{T}}}} & {{Equation}\mspace{14mu} (7)}\end{matrix}$

The approximation of the global poll probability P_(g) obtained by theremote site s_(i) represents the upper bound on the global pollprobability P_(g). Using this upper bound, at step S304, the remote sites_(i) estimates the total system cost C using Equation (8) shown below.

$\begin{matrix}{{C = {{{\sum\limits_{i = 1}^{n}{C_{l}{P_{l}(i)}}} + {C_{g}P_{g}}} \leq {{\sum\limits_{i = 1}^{n}{C_{l}{P_{l}(i)}}} + {\frac{C_{g}}{T}{\sum\limits_{i = 1}^{n}{E\left\lbrack Y_{i} \right\rbrack}}}}}}{C \leq {\sum\limits_{i = 1}^{n}\left( {{C_{l}{P_{l}(i)}} + {\frac{C_{g}}{T}{E\left\lbrack Y_{i} \right\rbrack}}} \right)}}} & {{Equation}\mspace{14mu} (8)}\end{matrix}$

In Equations (7) and (8), the remote site's estimated individualcontribution to the total system cost E[Y_(i)] is given by Equation (9)shown below.

$\begin{matrix}{{E\left\lbrack Y_{i} \right\rbrack} = {{\sum\limits_{v = 0}^{T}{Y_{i}{\Pr \left( {Y_{i} = v} \right)}}} = {{\sum\limits_{v = 0}^{T_{i}}{T_{i}{H_{i}(v)}}} + {\sum\limits_{v = {T_{i} + 1}}^{T}{{vH}_{i}(v)}}}}} & {{Equation}\mspace{14mu} (9)}\end{matrix}$

In Equation (9), Pr(Y_(i)=v) is the probability that the estimated valueY_(i) has the value v.

Referring back to FIG. 4, at step S306 the remote site s_(i)independently determines the local constraint T_(i) based on itsestimated individual contribution E[Y_(i)] to the estimated total systemcost C given by Equation (8). More specifically, for example, the remotesite s_(i) independently calculates the local constraint T_(i) thatminimizes its contribution to the estimated total system cost C, thusallowing the remote site s_(i) to calculate its local constraint T_(i)independent of the coordinator s₀.

The remote site s_(i) may calculate its local constraint T_(i) byperforming a linear search in the range 0 to T. Because such a searchrequires O(T) running time, the running time may be reduced to O(δ) bysearching for the optimal threshold value in a small range [T_(i)−δ,T_(i)+δ]. The linear search performed by the remote site s_(i) may beperformed at least once during each round or recompute interval. Eachtime remote site s_(i) recalculates its local constraint T_(i), theremote site s_(i) reports the newly calculated local constraint to thecentral coordinator s₀ via an update message.

If each remote site in the system is allowed to independently determinetheir local threshold values, ensuring that

${\sum\limits_{i = 1}^{n}T_{i}} \leq T$

is satisfied may not be guaranteed. To ensure that

${\sum\limits_{i = 1}^{n}T_{i}} \leq T$

is satisfied, each remote site's local constraint may be restricted to amaximum of T/n by the central coordinator s₀. However, such arestriction may reduce performance in cases where one site's value isvery high on average compared to other sites.

Alternatively, to ensure that the sum of the threshold values is boundedby T, the coordinator s₀ may determine if

${\sum\limits_{i = 1}^{n}T_{i}} \leq T$

is satisfied each recompute interval after having received updatemessages from the remote sites. If the central coordinator s₀ determinesthat

${\sum\limits_{i = 1}^{n}T_{i}} \leq T$

is not satisfied, the coordinator s₀ may reduce each threshold valueT_(j) by

${\frac{T_{j}}{\sum\limits_{i = 1}^{n}T_{i}}\left( {{\sum\limits_{i = 1}^{n}T_{i}} - T} \right)\mspace{14mu} {such}\mspace{14mu} {that}\mspace{14mu} {\sum\limits_{i = 1}^{n}T_{i}}} \leq T$

is satisfied.

Reactive Algorithm

Another illustrative embodiment provides a method for generating localconstraints using what is referred to herein as a “reactive algorithm.”The method for generating local constraints using the reactive algorithmmay be performed at each remote site individually or at a centrallocation such as central coordinator s₀.

If the method according to this illustrative embodiment is performed atindividual remote sites, then each remote site reports the newlycalculated local constraint to the central coordinator in an updatemessage during each recompute interval. If the method according to thisillustrative embodiment is performed at the central coordinator s₀, thenthe central coordinator s₀ assigns and sends the newly calculated localconstraint to each remote site during each recompute interval. As notedabove, the central coordinator s₀ and the remote sites may communicatein any well-known manner.

As was the case with the above-discussed embodiments, this embodimentwill be described with regard to FIG. 1, in particular, with the methodbeing executed at remote site s_(i).

In this embodiment, the remote site s_(i) determines its own localconstraint T_(i) based on actual local alarm and global poll eventswithin the system.

FIG. 5 is a flow chart illustrating a method for generating a localconstraint for a remote site using a reactive algorithm according to anillustrative embodiment.

Referring to FIG. 5, at step S402 the remote site s_(i) generates aninitial local constraint T_(i), for example, using the above describedMarkov-based algorithm. At step S404, the remote site s_(i) then adjuststhe local constraint T_(i) based on actual global poll and local alarmevents in the system.

For example, each time the remote site s_(i) transmits a local alarm,the remote site s_(i) determines that the local constraint T_(i) may belower than an optimal value. In this case, the remote site s_(i) mayincrease its local constraint T_(i) value by a factor α with aprobability 1/ρ_(i) (or 1, if 1/ρ_(i) is greater than 1), where α andρ_(i) are parameters of the system greater than 0. In other words, thelocal constraint at remote site s_(i) is not always increased inresponse to generating a local alarm, but rather is increasedprobabilistically. In one example, system parameter α is a constantselected by a network operator at the network operations center and isindicative of the rate of convergence. In one example, α may take valuesbetween about 1 and about 1.2, inclusive (e.g., α=1.1). Parameter ρ_(i)is computed according to Equation (10) discussed in more detail below.

Each time the remote site s_(i) receives a global poll, which is notgenerated in response to a self-generated local alarm, the remote sites_(i) determines that its local constraint T_(i) may be higher than anoptimal value. In this case, the remote site s_(i) may reduce thethreshold value by a factor of α with a probability ρ_(i) (or 1, ifρ_(i) is greater than 1). In other words, the local constraint at remotesite s_(i) is not always decreased in response to a global poll, butrather is decreased probabilistically.

As noted above, to obtain a more optimal local threshold T_(i) ^(opt),parameter ρ_(i) may be set according to Equation (10) shown below.

$\begin{matrix}{\rho_{i} = \frac{P_{l}\left( T_{i}^{opt} \right)}{P_{g}^{opt}}} & {{Equation}\mspace{14mu} (10)}\end{matrix}$

In Equation (10), probability P_(l)(T_(i) ^(opt)) is the local alarmprobability when the local threshold is set to T_(i) ^(opt) and theprobability P_(g) ^(opt) is the global probability when all remote sitestake the optimal local constraint values.

Equation (10) can be shown to be a valid value for ρ_(i) because if eachremote site s_(i) does not have an optimal local constraint T_(i)^(opt), then either (A) the current local constraint T_(i)′>T_(i)^(opt), P_(l)(T_(i)′)<P_(l)(T_(i) ^(opt)) and P_(g)(T_(i)′)>P_(g)(T_(i)^(opt)), or (B) current local constraint T_(i)′<T_(i) ^(opt),P_(l)(T_(i)′)>P_(l)(T_(i) ^(opt)) and P_(g)(T_(i)′)<P_(g)(T_(i) ^(opt)).

In case (A), if T_(i)′>T_(i) ^(opt), P_(l)(T_(i)′)<P_(l)(T_(i) ^(opt))and P_(g)(T_(i) ^(opt))>P_(g)(T_(i) ^(opt)) at site s_(i), then

$\frac{P_{l}\left( T_{i}^{\prime} \right)}{P_{g}\left( T_{i}^{\prime} \right)} < \frac{P_{l}\left( T_{i}^{opt} \right)}{P_{g}\left( T_{i}^{opt} \right)}$

and P_(l)(T_(i)′)<ρ_(i)P_(g)(T_(i)′). In this case, the average numberof observed local alarms is less than ρ_(i) times the average number ofobserved global polls. Thus, the local constraint value decreases overtime from T_(i) ^(l).

In case (B), if P_(l)(T_(l)′)>P_(l)(T_(i) ^(opt)), andP_(g)(T_(i)′)<P_(g)(T_(i) ^(opt)) at site s_(i), then

$\frac{P_{l}\left( T_{i}^{\prime} \right)}{P_{g}\left( T_{i}^{\prime} \right)} > \frac{P_{l}\left( T_{i}^{opt} \right)}{P_{g}\left( T_{i}^{opt} \right)}$

and P_(l)(T_(i)′)<ρ_(i)P_(g)(T_(i)′). Similarly, the threshold valuewill increase if the threshold is less than T_(i) ^(opt).

Given the above discussion, one will appreciate that the stable state ofthe system is reached when local constraints are optimized (e.g., T_(i)^(opt)) using the reactive algorithm. Once the system reaches a stablestate (at the optimal setting of local constraints), the communicationoverhead is minimized compared to all other states.

In an alternative embodiment, the remote site s_(i) may utilize theMarkov-based method to determine the local constraint T_(i) thatminimizes the total system cost C and use this value to compute thecontribution of the remote site to P_(g).

In this embodiment, the remote site s_(i) sends its individual estimatedcontribution E[Y_(i)] of P_(g) to the central coordinator s₀ at leastonce during or at the end of each recompute interval. The centralcoordinator s₀ sums (or aggregates) the components of P_(g) receivedfrom the remote sites and computes the P_(g) value. The coordinator s₀sends this value of P_(g) to each remote site, and each remote site usesthis received value of P_(g) to compute parameter ρ_(i). Illustrativeembodiments use an estimate of P_(g) provided by the central coordinators₀ to compute ρ_(i) at each remote site. The remaining portions ofinformation necessary are available locally at each remote site.

The above discussed embodiments may be used to generate and/or assignlocal thresholds to remote sites in the system of FIG. 2, for example.Using these assigned local thresholds, methods for distributedmonitoring may be performed more efficiently and system costs may bereduced. In one example, the local thresholds determined according toillustrative embodiments may be utilized in the distributed monitoringmethod discussed above with regard to FIG. 1.

In a more specific example, illustrative embodiments may be used tomonitor the total amount of traffic flowing into a service providernetwork. In this example, the monitoring setup includes acquiringinformation about ingress traffic of the network. This information maybe derived by deploying passive monitors at each link or by collectingflow information (e.g., Netflow records) from the ingress routers(remote sites). Each monitor determines the total amount of traffic(e.g., in bytes) coming into the network through that ingress point. Ifthe total amount of traffic exceeds a local constraint assigned to thatingress point, the monitor generates a local alarm. A network operationscenter may then perform a global poll of the system, and determinewhether the total traffic across the system violates a global threshold,that is, a maximum total traffic through the network.

In a more specific example, illustrative embodiments discussed hereinmay be used to detect service quality degradations of VoIP sessions in anetwork. For example, assume that VoIP requires the end-to-end delay tobe within 200 milliseconds and the loss probability to be within 1%.Also, assume a path through the network with n network elements (e.g.,routers, switches). To monitor loss probabilities through the network,each network element uses an estimate of its local loss probability, forexample, l_(i), i ∈ [1, n] and an estimate of the loss probability L ofthe path through these network elements given by L=1−(1−l₁)(1−l₂) . . .(1−l_(n)), which re-arranges into log(1−L)=log(1−l₁)+log(1−l₂)+ . . .+log(1−l_(n)). If a loss probability less than 0.01 is desired (e.g.,L≦0.01), then log(1−L)≧log(0.99). Inverting the sign on both sides, thistransforms into the constraint

${\sum\limits_{i = 1}^{n}\left( {- {\log \left( {1 - l_{i}} \right)}} \right)} \leq {- {{\log (0.99)}.}}$

In terms of the above-described illustrative embodiments, −log(1−l_(i))is local constraint T_(i) and −log(0.99) is global constraint T. Thus,the losses may be monitored in a network using distributed constraintsmonitoring. Delays can be monitored similarly using distributed SUMconstraints.

In a similar manner, illustrative embodiments may be used to raise analert when the total number of cars on the highway exceeds a givennumber and report the number of vehicles detected, identify alldestinations that receive more than a given amount of traffic from amonitored network in a day, and report their transfer totals, monitorthe volume of remote login (e.g., TELNET, FTP, etc.) request received byhosts thin the organization that originate from the external hosts, etc.

The invention being thus described, it will be obvious that the same maybe varied in many ways. Such variations are not to be regarded as adeparture from the invention, and all such modifications are intended tobe included within the scope of the invention.

1. A method for assigning a local constraint to a remote site in anetwork, the method comprising: generating, by a central controller, thelocal constraint for the remote site based on probabilities and systemcosts associated with a local alarm transmission by the remote site anda global poll in the network, the local constraint being generated inresponse to an update message received from at least one remote site inthe network; assigning the local constraint to the remote site.
 2. Themethod of claim 1, further comprising: calculating the probability of alocal alarm transmission by the remote site based on a histogram updatereceived from the remote site, the histogram update being indicative ofcurrent observation values at the remote site.
 3. The method of claim 1,further comprising: calculating the probability of a global poll basedon an aggregate of estimated observation values for a plurality ofremote sites in the network.
 4. The method of claim 1, wherein thegenerating step further comprises: estimating a total system costassociated with local alarm transmissions and global probabilities inthe network based the probabilities and system costs associated with thelocal alarm transmission by the remote site and probabilities and systemcosts associated with a global poll in the network; and wherein thegenerating step generates the local constraint based on the estimatedtotal system cost.
 5. The method of claim 1, further comprising:transmitting the assigned local constraint to the remote site.
 6. Themethod of claim 5, further comprising: detecting, by the remote site,violation of the local constraint based on a current instantaneousobservation value; and generating a local alarm in response to thedetected violation.
 7. The method of claim 6, wherein the detecting stepcomprises: comparing a current observation value with the localconstraint; and detecting violation of the local constraint if thecurrent observation value is greater than the local constraint.
 8. Themethod of claim 6, further comprising: detecting, by the centralcontroller, violation of a global constraint in response to thegenerated local alarm.
 9. A method for generating a local networkconstraint value for a remote site in the network, the methodcomprising: estimating, locally at the remote site, a total system costbased on probabilities and system costs associated with a local alarmand global polling of remote sites in the network; and generating alocal constraint based on the estimated total system cost such that thelocal constraint value is less than a maximum local constraint value,the maximum local constraint value being determined based on a number ofnodes in the network and a global constraint for the network.
 10. Themethod of claim 9, further comprising: approximating, at the remotesite, a probability of a global poll in the network based on a sum ofexpected system cost contributions of remote sites in the network andthe global constraint; and wherein the estimating step estimates thetotal system cost based on the probability of the global poll in thenetwork.
 11. The method of claim 9, further comprising: detecting, bythe remote site, violation of the local constraint based on a currentobservation value; and generating a local alarm in response to thedetected violation.
 12. The method of claim 11, wherein the detectingstep comprises: comparing the current observation value with the localconstraint; and detecting violation of the local constraint if thecurrent observation value is greater than the local constraint.
 13. Themethod of claim 11, further comprising: detecting, by the centralcontroller, violation of a global constraint in response to thegenerated local alarm.
 14. A method for adaptively assigning a localconstraint to a remote site in a network, the method comprising:generating a local constraint based on an estimated total system cost,the estimated total system cost being indicative of costs associatedwith local alarm transmissions and global polling of the network;approximating a probability of a global poll in the network based on asum of expected system cost contributions of the remote site and thegenerated global constraint; and probabilistically adjusting a localconstraint value at the remote site in the network by a first factor inresponse to a local alarm or global poll event in the system.
 15. Themethod of claim 14, wherein the adjusting step further comprises:probabilistically increasing a local network constraint for a first nodein response to a local alarm generated by the remote site; orprobabilistically decreasing local network constraint values for atleast a portion of the nodes in the network in response to a global pollevent.
 16. The method of claim 14, further comprising: detecting, by theremote site, violation of the local constraint based on a currentobservation value; and generating a local alarm in response to thedetected violation.
 17. The method of claim 16, wherein the detectingstep comprises: comparing the current observation value with the localconstraint; and detecting violation of the local constraint if thecurrent observation value is greater than the local constraint.
 18. Themethod of claim 16, further comprising: detecting, by the centralcontroller, violation of a global constraint in response to thegenerated local alarm.